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Abstract 

A new electronic structure model is developed in which the ground state energy of a molecular 
system is given by a Hartree-Fock-like expression with parametrized one- and two-electron integrals 
over an extended (minimal -|- polarization) set of orthogonalized atom-centered basis functions, the 
variational equations being solved formally within the minimal basis but the effect of polarization 
functions being included in the spirit of second-order perturbation theory. It is designed to yield 
good dipole polarizabilities and improved intermolecular potentials with dispersion terms. The 
molecular integrals include up to three-center one-electron and two-center two-electron terms, all 
in simple analytical forms. A method to extract the effective one-electron Hamiltonian of nonlocal- 
exchange Kohn-Sham theory from the coupled-cluster one-electron density matrix is designed and 
used to get its matrix representation in a molecule-intrinsic minimal basis as an input to the 
paramtrization procedure - making a direct link to the correlated wavefunction theory. The model 
has been trained for 15 elements (H, Li~F, Na-Cl, 720 parameters) on a set of 5581 molecules 
(including ions, transition states, and weakly-bound complexes) whose first- and second-order 
properties were computed by the coupled-cluster theory as a reference, and a good agreement is 
seen. The model looks promising for the study of large molecular systems, it is believed to be 
an important step forward from the traditional semiempirical models towards higher accuracy at 
nearly as low a computational cost. 
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I. INTRODUCTION 



Atomistic computer simulations of complex chemical systems and materials at quantita- 
tive level are a great challenge for modern science: not only a high accuracy of computed 
potential energy surfaces is needed for systems with many atoms, but also a higher speed 



of computation for a t 
force fields pioneered 



: lorough sampling of the configurational space. Molecular mechanical 
l| 44 years ago are still almost the only practical method of calcula- 
tion in many fields thanks to their very high speed and despite their well-known limitations. 
With their fixed bonding topology, they are designed first of all for conformational studies 
where nonbonding interactions play the main role, their treatment of electrostatics may be 
as sophisticated as to account for polarization effects [2]. Extensions to treat one (or few) 
chemical reaction steps need careful parameter adjustment for each active center studied. 
General reactive force fields [4, 5| with geometry-dependent bond orders and atomic charges 
seem to be a logical next step and already remind of models within the Hohenberg-Kohn 6| 
density functional theory (DFT). Our own experience with this kind of models lead us to 
believe that instead of designing more and more complicated charge and bond order func- 
tional it should be much easier to incorporate at least a single matrix diagonalization of an 
effective one-electron Hamiltonian into the model (likely in a linear-scaling fashion js, 9|), 
just like the mainstream DFT took the Kohn-Sham path [10], and thus we escape the force 
field and enter the realm of molecular quantum mechanics. 

At the other extreme, rigorous wavefunction methods give useful results starting with the 
second-order many-body perturbation theory (MP2) ll|], but much better with the coupled- 



cluster theory 



12| with single and double substitutions (CCSD) 



13[ or - as the golden 

standard - with the further perturbative account of triple substitutions (CCSD(T)) 1^ . 
Their fifth, sixth, and seventh power scaling of computational cost with the system size 
and huge prefactors, as well as the slow convergence of the computed properties with the 
basis set size, make them hopelessly slow for large molecules, but they are indispensable for 
getting small-molecule reference data for training and testing all kinds of models. 



Kohn-Sham DFT on its way from 



161, with further inclusion of a fraction 17| or, much better, the full long-range part 181 of the 

nn 

nonlocal exchange and the dispersion tail [19|,|20[, has grown into a rather accurate electronic 
structure model with favorable system-size scaling properties. In a standard implementation. 



oca l jlo| to generahzed-gradient approximations 



15 
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localized atom-centered basis functions are used to solve the self-consistent field (SCF) 
equations (plane wave techniques [2l| are limited to the less accurate pure density functional 
and cannot work as fast with the nonlocal exchange), the analytical evaluation of two- 
electron Coulomb integrals and the numerical integration of exchange-correlation terms are 
the bottleneck for smaller system sizes but can be made linear scaling starting from around 
1000 atoms, in that limit we estimate the computation of DFT energy functional to be about 
10^ to 10^ times slower than the most sophisticated polarizable force field energy and gradient 



evaluation. Density-fitting techniques [22|, |23| can speed up the calculation of these terms by 



up to 100 times, but only for the pure DFT. Pseudospectral methods [2^, |25| can show up to 
100 times speed-up also for the nonlocal-exchange DFT. Even if the integrals were for free, 
there would be another serious bottleneck in the linear algebra of SCF equations. A smallest 
meaningful atomic basis set with 5 functions for H and 14 or 18 for Li-Ne would yield little 
to no sparsity of the density matrix for typical three-dimensional molecules with up to 3000 



atoms 



26| and around 30000 basis functions, so about 10 floating-point multiplications 



and additions would be needed to do one matrix-multiply! In this regime, one SCF energy 
calculation can hardly take much less than one day even on a modern high-performance 
parallel computer. 

The only way for the electronic structure theory to compete with the force field methods 
lies in the use of a finite-dimensional model Hamiltonian defined by its matrix elements in 
a minimal atomic basis representation. This is an alternative strategy with the Kohn-Sham 
DFT - instead of defining the universal density functional in some limited (approximate, 
parametrized) form and then computing the arising integrals rigorously at each molecular 
geometry, these molecular integrals themselves can be directly modeled as the functions of 
atomic coordinates. One may note that the Coulomb potential of the nuclei in molecules 
is a very special class of external potentials for a system of electrons, and the widely-used 
density functionals are not as universal as thought, being often fitted 27|] to molecular 
data. The minimal basis representation is not a limitation - not a fixed free-atom but a 
molecule-adapted set of (deformed) atomic functions is implied. In our earlier work [28] we 
have shown how to extract such a basis set from an accurate (or exact) solution of Kohn- 
Sham equations at a given molecular geometry in a unique and molecule-intrinsic way. The 
effective local potential of Kohn-Sham theory can in turn be extracted from the correlated 
wavefunction theory by a stable numerical procedure 29| , in Section III Dl of this work we 
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derive its analog for the case of nonlocal exchange. (After we had done ours, we were made 



aware 



30| of a parallel work on the effective valence shell Hamiltonians 3lN34| but for a 



fully correlated - and not SCF - treatment of the ground and excited states). Now that the 
effective minimal-basis matrix elements of the model Hamiltonian have the first-principles 
foundation and can be computed numerically, they can be used to guide the work on their 
approximat ions . 

Into this category fall the semiempirical electronic structure models based on the ne- 
glect of diatomic differential overlap (NDDO) formalism that have had nearly half a century 



of conservative evolution. Conceived as the simplest approximations 



351 ] for valence-only 



minimal-basis SCF calculations with most multicenter integrals set to zero, they became a 



breakthrough in computational chemistry 



361] when a systematic parametrization 371-142] 



to fit experimental molecular data turned them into predictive phenomenological models. 
As the growing computer power allowed the DFT, MP2, and other rigorous methods to 
be applied to chemically-interesting systems and the standards of accuracy tightened, the 



semiempirica. 
were studied 



models began to lose the game. Some limited orthogonalization corrections 



43 



44| but lead only to a limited improvement. Poor intermolecular potentials. 



45L 
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46]by simple diatomic disper- 
49I in the spirit of molecular 



especially for hydrogen-bonding, were tried to be cured 
sion corrections |47| and further by triatomic corrections 
mechanics - by adding the terms that depend on the atomic coordinates only and not on the 
electronic state. A proper treatment of polarizability within the minimal-basis formalism is 
not straightforward, one solution [50|] is to add polarization functions into the basis but this 
slows down the calculations. A more attractive way was found 5ll, l52| in which a polariza- 
tion term borrowed from molecular mechanics is added on top of the SCF equations. With 
all these recent developments, the desired improvement in accuracy of the semiempirical 
NDDO models for many applications is still not achieved. 

Here we report our new parametrizable electronic structure model that evolves along a 
different line. We consider an extended atomic valence basis set that has radial and angular 
polarization functions added to the minimal set on each atom, and then we design a new 
two-layer SCF method (Section III Al) in which the variational equations are defined within 
the minimal subspace and the contribution from the polarization subspace is included in the 
spirit of second-order perturbation theory. It should be noted that our SCF method is fairly 



different in many ways from other known dual-basis techniques 53N55| . The parametriz 
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able molecular integrals of our model (Section III Bp have more general functional forms 
and include new terms not seen in the traditional semiempirical NDDO models. We add 
the long-range dispersion corrections into the two-electron integrals - a more meaningful 
treatment of the two-electron correlation effect that depends on the electronic state. Our 
model also treats the molecular polarizability in a natural and self-interaction-free way. The 
parametrization procedure (Section III Cp we use to train our model has a number of new 
terms added into the optimization process: the effective one-electron Hamiltonian matrices 
extracted (Section IIIDI and 28(]) from the correlated wavefunction theory, the molecular 
dispersion coefficients (Section III ED and the molecular electrostatic potentials on the sur- 
faces (Section III Fh . as well as the second-order response properties (force constant matrices, 
atomic polar tensors, and dipole polarizabilities) - all these values coming from the high- 
level (coupled-cluster) calculations are required to be reproduced (in the least-squares sense) 
by the parametrized SCF model. 

All this methodology defines our model up to the values of parameters that have to be 
optimized on a set of reference molecular data. Here we report (Section IIIII) a preliminary 
parametrization for 15 elements H, Li-F, Na-Cl on a very diverse set of molecular structures 
to assess the worst-case performance of the model - it stands the test, seems to be already 
usable and useful for some applications, and awaits a future extension to heavier elements. 



II. THEORY 



A. Energy expression 

In our model, the total energy of a molecular system with K nuclei at positions {r^}, 
k = 1, . . . , K, in an external field v{r) can be written as a sum of three terms 

E = Eo + Ei + E2, (1) 

the one independent of the valence electronic structure 

-Eo = ^Cfc + ^ qkQk'lrk - rk>\'^ + ^gfet^(rfe) (2) 

k k<k' k 

being the sum of core electron energies e^, the repulsion of atomic cores with charges {qk} 
and their interaction with the external field; the Hartree-Fock energy within the minimal 
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atomic basis 

-E'l = I ^ iHfj,u + F^ua) D^ua, (3) 



Ffiva — H -\- ^ ^ RiJ.i'crfM'u'a' -Df^'u'cr' J (4) 

the indices /i,!/ — 1, . . . ,M running over the minimal set, the spin label cr = ±|; and the 
second-oder term 

E2^Y.^iSaD,.a, (5) 

IJ,V(T 

^aixa ~ ^ _ ^ 1 \' ) 

that accounts for the effect of polarization functions (labeled by a — M -\- 1, . . . ,N) in a 
perturbative way. The one-electron density matrices 

F^ IxvcT ^ ^ C ixicr^ via (9) 
i=l 

with A^o- electrons for each spin are formed from the orthogonal coefficient matrices 

^ ] CixiaCf^ja — Sij. (10) 

The orthogonahzed atomic valence basis functions 

can also be labeled by their principal n, angular Z, and azimuthal m quantum numbers and 
atomic centers for each I no more than one n value is used in either minimal or polarization 
set. The one-electron integral matrix is a sum 

H^. ^T^. + Y1 Vit' + (12) 

k' 

of kinetic energy, effective core potential, and external field terms. The spin-dependent 
two-electron integrals 

Rnvaix'u'a' — Rfiun'v' ^aa' Rfj,' ufj,v' (13) 
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are built from the spinless Coulomb repulsion integrals by the proper antisymmetrization. 
The values < and £:q, > in the denominator of Eq. ([7]) are taken as atomic constants 
and play the role of diagonal energies of the zeroth-order problem of perturbation theory, 
their signs imply that E2 < 0. Had we had s^o- = 1 in Eq. as we did in an earlier 
model, the total energy E would be a cubic function of the density matrix, whereas the 
Hartree-Fock energy Ei alone is quadratic - the perturbative inclusion of higher-order basis 
set effects is leading to the appearance of effective three-electron integrals. Moreover, E2 
would be quadratic in the external field - as needed for the proper account of polarizability 
within a formally minimal-basis treatment. Our experience has shown, however, that the 
simplest choice of = 1 often leads to a collapse of E2 when atoms in a molecule get 
crowded - we have overcome this problem using the damping factors 



that need to be spherically averaged 

^AtfT = Y+~2/~ ^ ^ucT^hli^^k^k^ (15) 

before the insertion into Eq. (Q. 

It is noteworthy that our model has the integrals of only two types: either with all 
functions from the minimal set (all-minimal) as in Eq. (jl]) or with all but one from the 
minimal and one from the polarization set (minimal-polarization one-electron and 3-minimal- 
1-polarization two-electron integrals) as in Eq. (jH]). 

The energy expression ([1]) is minimized under the orthogonality constraints ffTOj) to get 
the ground-state solution of the electronic structure problem, the stationary conditions can 
be given in the form of self-consistent-field (SCF) one-electron equations 

{J^fMUa - SfMU^ia) Cyi^ = (16) 

V 

with the effective Hamiltonian matrix derived as 

T^,, = OE/dD^,,, (17) 

the explicit expression for the latter is lengthy but straightforward. To compute a one- 
electron property V such as dipole moment or molecular electrostatic potential in a consistent 
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way, the energy derivative formalism should be used leading to 

v = Y, y^^uD^,, + (18) 

with the minimal-polarization block of the effective density matrix given by the derivative 

D^, = dE/dH^,, (19) 

its explicit form is again lengthy but straightforward. 

We have written a computer code that solves these SCF equations to get the energy 
and its first and second derivatives with respect to the atomic coordinates and the applied 
uniform electric field, the derivatives being computed in a fully analytic way. 



B. Molecular integrals 

The one- and two-electron integrals in the energy expression ([1]) refer to the orthogonal- 
ized atom-centered basis functions (fTT!) and as such are nontrivial functions of (in general) all 
atomic coordinates of the molecular system. They could have been computed from the first 



principles using a three- level extension of our zero-bond-dipole orthogonalization scheme 28 1 
applied to a fixed all-electron atomic basis - the core set is orthogonalized first, the minimal 
valence set is orthogonalized to the core and then within itself, and the polarization set is or- 
thogonalized to the core and valence and at last within itself. Such first-principles integrals, 
however, can give, at best, only a rough approximation to the Hartree-Fock molecular energy 
within a polarized basis, but we are aiming at a parametrizable model that can yield accu- 
rate molecular properties with the electron correlation included in the spirit of Kohn-Sham 
density-functional theory. Thus our model uses parametrized explicit analytical formulas 
for the molecular integrals giving values that differ somewhat from the first-principles ones. 
The two-electron integrals are split into a sum 

of "electrostatics" and "dispersion" parts, the former quickly (exponentially) reaching the 
asymptotics of Coulomb interaction of the two {fiu and /iV) charge distributions and the 
latter having a characteristic r'^ asymptotic tail. For the integrals with one polarization 
function 

Raufi'u' = RaL'u' (21) 



only the electrostatic part is used. The electron correlation can already be modeled by 
decreasing the magnitude of the two-electron integrals. 

The one-electron integrals (IT^ have long-range Coulomb terms V^vfc' and it is much easier 
to work with their short-range analogs 

F,. = H,. + Y1 - K'.v) P,' > (22) 

and in the same way for F^u, where a diagonal promolecule density matrix with constant 
spherically-symmetric atomic populations 

P,^ = Pi^k^ (23) 

is used to add the promolecule Coulomb and exchange terms, so that they fully neutralize 
the core charges 

^Pik = qk. (24) 

I 

It is trivial to rewrite the energy expression in terms of F and R integrals, working with the 
short-range integrals not only simplifies the parametrization but is also of great advantage 
for large-scale calculations. Moreover, some two-electron terms can be accounted for (on 
the average) only within the F integrals and neglected in the R integrals, as we do in 
the following. The (promolecule) one-electron integrals F^^ are of two kinds: one-center if 
= K, and two-center if ^ ky^ but they should also depend on the positions of the 
other atomic centers k ^ k^j,, ky. In our model, we use an additive scheme 

F^u = F^yfl + F^y^k (25) 

where the first leading term is either one- or two-center and the sum runs over the two- or 
three-center corrections, respectively. The leading one-center integrals are atomic constants 

F^iufl '= Sm^,m^Sl^l,^Fi^k^, (26) 

' ='"^ 0. (27) 

The leading two-center integrals 

Ff,ufi " ^mmui'^f^i')^mmA'^f^i')Fml^Lkf,k,{r^iu) (28) 
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as well as Fca,fi can be reduced to functions of interatomic distances and the well-known 
transformation matrices Al^^,{z) of spherical harmonics upon spatial rotation, 

The two-center corrections to the one-center integrals 

ml 

as well as F^jy^k ^-re done in the same way, making further use of the triple products -B^^/^// 
of spherical harmonics. The most complicated are the three-center corrections F^^^k with 
kfx^ kv ^ k that can be exactly reduced only down to nontrivial functions of three variables 
- we have experimented with the triple series in prolate spheroidal coordinates which can 
be made very accurate, but in the end we have chosen the (approximate) factorization of 
the three-center terms 

A 

A 

into the products of two-center terms 



Fp,v,k ^ ''^^kxkSfj.xSi.xfx (31) 



Sfxx ''^Al^^^{z^x)A''^^^{z^x)Smif,ixk^,kx{r^x) (32) 

m 

where A labels the functions of some expansion basis on center k in which the underlying 
operator is diagonal with eigenvalues fx, and ai'c the overlap integrals. The three-center 
terms with one polarization function have minor effect and we set them to zero, 

F^.,k '"^=^' 0. (33) 

Of the two-electron integrals, only the one- and two-center ones are of fundamental im- 
portance and are included in our model, the rest are generally small and are set to zero. 
The one-center integrals 

(0) k^=k.=k^,=k,, (0) (o) . ^ 

have the leading term of atomic constants 

(0) ki,=k^=k^,=k,, y-^ . I l^, l^, 

ml 
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and additive corrections for the effect of surrounding atoms 

^fiu^i'u',k ~ ^m^m^Sm^,m^,Sl^l^Sl^,l^,'^l^l^,k^k{ffj,k), (36) 



for the latter we have also tested the general expression 

^(0) k,,=k^=k_^,=k^,^k ^ ^i' i^, i^, ^^„^ ^, 

mm'll' 

but found only the terms with Z" = Z = Z' = to be important, we also find it safe to set 



^tiiyii'iy',k — 2-^ °rnrni,m^'='m'ra ,m^,'^Qrarn'^Vni'l^J.J^d^,k^,k\> ilk) 



CV.' = 0- (38) 
Of the two-center two-electron integrals, only the long-range ones 

fhfhfh'fh'mlV 

^ ^m-mih-^mm.'ih' ^mWi^ij^, 1^1 k^k^j (r^^' ) , (39) 



?(0) 

center terms and found it safe to set them to zero. We should stress that it is thanks to 



and in the same way R j^niyii are included in the model, we studied the effect of other two- 



the special properties of the underlying zero-bond-dipole [28| orthogonalization of the basis 
functions that all two-electron integrals involving two-center product charge distributions 
0^(r)0^(r), 7^ ky, are small and can be neglected. The two-electron dispersion-model 
integrals are naturally limited to the isotropic two-center terms 

(6) k,=k,^k^,=k^, 

The zeroth-order eigenvalues in Eq. ([7]) can be optimized as atomic parameters, but we find 
it enough to set 

Now that all molecular integrals in our model are defined in terms of constants and 
functions of one variable r, parametrized formulas for the latter are needed. The short- 
range functions of Eqs. (!28l) . (!30|) . (!32l) . and (136!) can be fitted to a high accuracy by the 
general expansion 

n 

f{r; a, Co, ... , Cn) = exp(-ar) ^(ar)'^c«; (42) 

ft=0 
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if enough terms are taken, the long-range functions of Eq. (1391) can be done hkewise with 
one long-range term added 

n 

gi{r; q, a, Cq, . . . , c„) = qr~^~^ U2i{ar) + a^+' exp(-ar) ^{ar)^'"^c^ (43) 

with 

Un{r) = 1 - exp(-r) — (44) 

ml 

m=0 

and the dispersion tail corrections f HOj) can easily be added in the form 



g'^^\r;a,c) = cr ^Ug{ar). (45) 

We have experimented with these general expansions within our electronic structure model 
(working with more than 20000 atomic and atom-pair parameters for a set of 7 chemical 
elements!), but found later that much more compact expressions using sum rules with mostly 
atomic parameters work quite well and yield more meaningful optimized parameter values. 
The short-range terms can be given either by a single exponential (n = 0) term of Eq. ( H2|) 
or by a three-parameter hyperbolic-secant function 

/o(r; a, 6, c) = cexp(— ar + 6) (l + exp(— 2ar + 26)) ^ (46) 



with 6 > 0. For the long-range terms, the leading term of Eq. fl43|) is enough for all multipole- 
multipole and charge-multipole interactions and only a single exponential term needs to be 
added for the charge-charge interactions. 

The sum-rule formulas used in this work are as follows. In Eq. (1281) we have 



Fml^,l,k^,k,ir) - f (r; ^/a^ 0.ml^,L^^^^, C^iCyCml^,l^£^^,£,J) (47) 

with the shorthand notation for atomic parameters = a/^^^, = q^a,.^, and being 
atom-group labels - in particular, = 1 if yU is on H, = 2 if yU is on a second-row atom 
Li-F, and S,^ = ^ for yU on Na-Cl. Thus we work with atomic parameters and diatomic 
atom-group parameters cLmi^i^^f,^^, Cmi^,i^(,^,i^ in Eq. (H7|) . and this is also done in the formulas 
below. For minimal-polarization analog of Eq. fl28|) we choose 
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and it should be understood that, here and below, each integral class has its own set of 
parameters, for example a^, in Eq. f l47|) is not the same as in Eq. fj48|) - the same notation 
is used to save space. In Eq. (I5U]) we choose the form 

Vii.i^kA'^) = fo (r; + aii^j^^^^^^^ , b^^^^ , i(c^ + c^)ckCii^i^^A (49) 

\ + + Zak J 

for both minimal-minimal and minimal-polarization integrals. In Eq. ( I32l) we have 

'5mi^«AfcMfcA('^) = /o ( -^-^^ Ki;x^Cf,CxCml^l^i:^iA . (50) 



a„ + OA ■ 



In Eq. (15^ we have 



1,^' a, + a,,+2a. ' ^^"^^ ^^^^ + j . (51) 



Vi^i^^k^r) = fo{r; 
In Eq. (l39|l the form 

^ / X 2(a/. + Q^)(V + a^') ic c \ /„n 
'^mii'i^i^i,i^,kf,k,[r) — gi+i' I r; qiif,i^qi'i,i^,qmii', : : : , -2"oim' [^^) 

is used for both 4-minimal and l-polarization-3-minimal two-electron integrals, with the 
atomic multipoles qa^i^ and the fundamental constants of multipole-multipole interaction 
qmii', in the charge-charge 4-minimal case one exponential term is added. In Eq. fHOj) we 
have 

As can be seen, the atomic a-parameters in all these formulas play the role of radial scale fac- 
tors, the atomic c-parameters are multiplicative prefactors, and the atom-group 6-parameters 
control the shape of functions at shorter distances r. 

C. Parametrization procedure 

We parametrize our model by minimizing the function 

S = SE + £F + SD+£g + £H + £d + Sa + £g + Su + Se + S, (54) 

that measures the deviation of molecular properties predicted by the model from the refer- 
ence values computed by the higher-level theory on a set of molecules (molecular geometries). 
The energy term 

£e = ^wf, Cmn{Em - E.^) I , (55) 

n \ m / 

13 



with m = 1,...,M labeling each molecule, can be used not only to fit each predicted 
energy Em to the reference value Em in the trivial case Cmn = ^mn, but also for giving higher 
weights to some chemically-meaningful energy differences, such as conformational energy 
changes, reaction energies and activation barriers. 
The effective one-electron Hamiltonian terms 

^F = 5^ti;^tr{(F^-F™)2} (56) 

m 

and the density-matrix terms 

£d = 5^ w;^tr{(D„ - D„)2} (57) 

m 

for each molecule take from Eq. f|T7|) and F^ from the theory of Section III Dl below, the 
density matrices and come from the diagonalization of F^ and F^- One may put all 
= and work with only ^ 0, assuming that an accurate F-matrix should also yield 
an accurate D-matrix, but in practice we have not-so-small errors in F-matrices and find it 
a better compromise to balance between F and D terms, thus giving more importance to 
the occupied-occupied and occupied- virtual blocks of F. 

The energy first and second derivatives with respect to atomic coordinates for 
each molecule m are gathered into the terms 

^9 = XI ^m(gm - gm)'^W^(g„ - g^), (58) 

m 

= J^t^^tr I ((H^ - H^)W^)'| , (59) 

where the weight matrix 

W= (h2 + /.2i)-^/2 (60) 

is a well-behaved positive-definite replacement for the inverse H~^. This weighting scheme 
emphasizes the weak modes and, in our experience, it drives the model towards a balanced 
reproduction of potential energy surfaces. We set = 2~^^ in this work, close to the lowest 
eigenvalue of H for the water dimer (H20)2 used as a prototype. For reference geometries 
at stationary points (gm = 0), as is normally the case, Eq. ( l58l) can be seen as a weighted 
sum of squared distances between the stationary points estimated by the model and the 
reference, whereas Eq. (l59l) is a weighted sum of squared dimensionless relative deviations 
of force constant values. 
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Electric dipole moments and polarizabilities 



dm dm 



m 



as well as atomic polar tensors 



^9 = X^^m tr{(qm - qm)^(qm - qm)} 



(61) 
(62) 

(63) 



are included into the optimization in a straightforward way, as are the molecular electrostatic 
potentials 

= E<E (^(^'^) - U{r.)ys. (64) 

m K 

on the grids of points r^^ at discretized surfaces with elements (see also Section [11 Fp . 

Molecular dispersion coefficients (discussed in Section fll El below) are treated in the form 
of square roots 

^6 = Y1 (cl% - Cefm) (65) 



as these have a more natural system-size scaling. 

The last (optional) term is a sum of soft constraints 



K \ p 



(66) 



(x — a)^, X < a 
s{x,a,b) = I 0, a < X < b 

{x — by, b < X 
on linear combinations of optimization parameters {xp}. 



(67) 



D. Effective Hart ree-Fock- like Hamiltonian from correlated wavefunction theory 

Given the one-electron nonidempotent density matrices "P^i^o- from a correlated wave- 
function calculation, with fi, v labeling (in this Section only) the functions of an extended 
atomic basis, we set up our Hartree-Fock-like self-consistent field procedure based on the 
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minimization of the energy expression 

filler fiucrfi' u' a' 

+ f E i^i"^^^ - ^^-) (^ivl - ^^.vv) (68) 

with respect to the idempotent density matrices D^^^. Eq. dHHD is a sum of the Hartree- 
Fock energy and a quadratic density penalty function weighted by w, the associated effective 
Hamiltonian matrix 

fi'i/'a' fi'u' 

is a sum of the Fock matrix and a local spin-dependent correlation potential matrix, the 
latter arises as the scaled Coulomb potential of the difference spin density. If a complete 
basis were used, the limit — ?■ oo, if exists, should yield the exact effective Hamiltonian 
of Sec. IIB of Kohn and Sham's work lOj. In practice, we work with a finite incomplete 
basis and a reasonable finite value of w should be chosen. Our procedure can be seen as 
a nonlocal-exchange-local-correlation analog of the local-exchange-correlation procedure of 
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E. Molecular dispersion coefficients 

For two molecules at a large distance r between their centers, the leading asymptotic 
term of the dispersion part of the intermolecular potential is Cgr"^ with cg being a function 
of the relative orientation. The orientational average of cq is a bimolecular constant that 
can be most easily computed within MP2 as 

^ 2 \{(f>aa\r\(l)ia)f\{(l)a'a'\r\4)i,„,)f ^^^^ 

aiu a'l'ar' 

from the dipole moment integrals over one-electron wavefunctions and the one-electron 
energies e, z's label the occupied and a's the virtual states each localized on one of the 
molecules. Much more complicated expressions can be derived for higher-order correlation 
methods, but we will use Eq. ( ITOll in this work as it gives accurate enough values for our 
purpose. We take only the case of two identical molecules and such cg becomes a molecular 
constant of interest, moreover, we have seen that a heteromolecular cq is quite close to the 
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geometric mean of two molecular constants. Within our parametrizable model the value to 
put into Eq. f l65|) is simply 

= Yl ^mm-Cm (71) 

with as in Eq. (1551) . By construction, Eq. fl5^ always gives isotropic Cq values that follow 
the rule of geometric mean. 



F. Surfaces for sampling molecular electrostatic potentials 

A smooth surface is preferable for sampling the molecular electrostatic potential, so we 
have tried the isodensity surface p(r) = po and found that po ~ 10^^ should be used to 
enclose most of the density as needed. While this may work well for the exact density, 
the one calculated within a limited finite basis approximation may not be accurate enough 
at such small values in the tail region, indeed we have seen some rather weird shapes for 
molecules as simple as LiF. 

In this work we set a new definition of molecular surface p{r) = po with 



p(r) 



J s{r' - r)pi(r', r")s{r" - r) d=^r' d^r" (72) 



in terms of the one-electron density matrix pi(r, r), the density being its diagonal part p(r) = 
pi(r, r), as inspired by the analysis of the exchange repulsion effects. For the broadening 
function s(r) localized around r = we make the simplest choice 

s(r) = cexp (a|r|^) , (73) 

in the limit a oo with c = (a/vr)^/^ Eqs. ([72]) and ([73]) yield p(r) p(r). With c = 1, 
a = 1/16, and po = 1/4 we get good surfaces for the molecules studied in this work. The 



surface discretization is best done with the spherical quadrature rules 



56 



57| 



III. CALCULATIONS 



The training set of 5581 molecules used in this work covers a broad range of structures 
built up from 15 elements H, Li-F, Na-Cl, there are both energy-minimum and transition- 
state geometries of neutral, cationic and anionic species with closed-shell and (high-spin) 
open-shell electronic configurations dominated by a single determinant. We have spent a lot 
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of time setting up this database guided by our chemical intuition, finding meaningful struc- 
tures of (almost) all chemical compositions with up to three nonhydrogen and any number of 
hydrogen atoms, and less thoroughly for up to eight nonhydrogen atoms. We tried to sam- 
ple all kinds of chemical bonds - from metallic through covalent to ionic, hydrogen bonded 
dimers and clusters (for example, neutral, protonated, and deprotonated water clusters with 
up to 6 O atoms) as well as weaker intermolecular complexes are also carefully chosen. 

The reference data were generated by the CCSD method [l^ with nonrelativistic Hamil- 
tonian and correlation-consistent atomic basis sets 58[ of sizes shown in Table [H The LI 



set has one set of valence polarization functions, the Lll set is for a correlated treatment 
of (outermost) core shells, and the Ll+1 set has diffuse functions added as Rydberg shells - 
this is the entry-level basis for quantitative CCSD calculations, we would have preferred the 
next-level (L2, L22, L2+1) had we had enough computer power. An archive of all molecular 
data files is available for download jsol . 

The model is defined by the energy expression of Section III Al the molecular integrals of 
Section [ilBl and the atomic basis set sizes given in Table HTl with one radial function for each 
angular symmetry. Besides the full model, 8 simplified models with some classes of integrals 
set to zero are also analyzed. All parameters of each model have been simultaneously 
optimized on the full reference data set, the weights assigned to each molecular property and 
the values of each error term in Eq. flMj) at convergence are given in Table IIIIl full details of 
the settings along with the optimized parameter values can be found in the attached files 59 |. 
Our optimization algorithm computes numerically all first derivatives of molecular properties 
with respect to the parameters and uses linear searches along the optimal direction to find 
the nearest local minimum, the choice of the starting guess is so far from straightforward 
that it cannot be documented here. 

It is not trivial to judge the quality of such a model by the net error terms, but still 
some insight into the role of the whole classes of parametrized molecular integrals can be 
gotten as they are removed from the model in the order of increasing importance. As 
expected, the dispersion terms ( HOl) have the smallest impact as shown by Model 2. The 
role of the polarization basis functions is dramatic - Model 4 that has them all removed 
nearly doubles the most important error terms, and about 2/3 of this effect come from the 
one-electron integrals (|28|) and fl30|) . Model 3 with only the two-electron integrals (|39|) over 
the polarization functions included is the simplest one to properly account for the molecular 
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dipole polarizability but this does also improve the other molecular properties. Starting 
from Model 4 we can study the role of the integrals over the minimal basis that are new 
to our model. The two-center corrections to one-center two-electron integrals fl36p are the 
least important, though Model 5 without them shows already about 1.5 greater values of 
most error functions. Further degradation is caused by zeroing out the multipole moments 
in the two-center two-electron integrals (l52|l as in Model 6 and, further on, in the one-center 
two-electron integrals f p5|) as in Model 7. At this point. Model 7 is already too simplified to 
support the burden of the first (!58l) and second (!59|) derivative terms as they were before, so 
we need to relieve it by setting a larger value of in Model 7' to get ready for the last strike. 
A dramatic degradation is seen as the three-center corrections to two-center one-electron 
integrals flHTl) are zeroed out in Model 8, and yet again as the two-center corrections to 
one-center one-electron integrals (|30|) are dropped in Model 9. With this overview, it should 
be clear that these integral terms have not only the quantitative but also the qualitative 
impact on the performance of the model - the simplified models may even fail to reproduce 
the existence of some stationary point on molecular potential energy surfaces, for example, 
Model 6 predicts the symmetry of the hydrogen peroxide molecule H2O2 to be C2h instead 
of C2, and Model 8 gives the cyclobutane molecule C4H8 the symmetry instead of the 
right D2d - the reliable prediction of intra- and intermolecular conformations is a challenge 
that only our full model seems to meet. 

A binary executable code of our program for Linux/x86_64 architecture is made avail- 
able to the interested researchers worldwide for further testing and some preliminary 
applications of the new model. 

IV. CONCLUSIONS 

We are pleased to have developed the new parametrizable electronic structure model 
that both has a number of good formal properties and performs well in numerical tests. 
It evolved through a trial-and-error process until it has reached a satisfactory level of ma- 
turity. With this parameter set for 15 elements, many interesting systems can already be 
studied, we are going to see how it works for the structure prediction of biomolecules and 
the chemical processes in condensed phase. If some weaknesses will be found, the model can 
be reparametrized on a more representative training set. 
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Future work may include: building a comprehensive database of prototype molecular 
structure for further training and testing of the model, its extension to heavier elements, a 
reparametrization based on more accurate reference data from CCSD or CCSD(T) calcula- 



tions with a next-level basis, the investigation of linear-scaling techniques |60H65| for solving 
the SCF equations of our model to find a both accurate and fast algorithm for studies of 
large systems. A formalism to treat excited electronic states dominated by single excitations 
from a single determinant can also be worked out, either within the configuration interaction 
with single substitutions (CIS) or within the linear response theory. 
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TABLE I: Atomic basis set sizes for CCSD calculations. 



Atom Set Core Contracted Primitive 

H LI 2,1 8,4 

Li, Be Lll 4,3,1 12,8,4 

B, C LI 1 3,2,1 12,8,4 

N, O, F Ll+1 1 4,3,1 14,10,4 

Na, Mg Lll 1 5,4,2 18,13,8 

Al, Si, P LI 2,1 4,3,1 18,13,5 

S, CI Ll+1 2,1 5,4,1 20,15,5 
Number of radial functions for each angular quantum number is given. 
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Atom 



TABLE II: Atomic basis set sizes in the parametrized model. 



H oil 

Li, Be 112 

B, C, N, O, F 1 2 2 

Na, Mg 112 

Al, Si, P, S, CI 1 2 2 



polarization I" 
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TABLE III: Errors in molecular properties from the parametrized models. 



Integral tcrm'^ 
(l>0) 

^(0)(fej,=fe„=fe^,=fe„,#fc) 
fii^fi' v' ,k 

p{k^ = k^^k) 
aVjk 

(0)(fc„=fc„7^fc,,/=fc„/) 
^ / / n 

Number of parameters'" 



Error term'^ 

Sf 
£d 

Sh 

Sa 
£6 



Eq. Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 7' Model 8 Model 9 

+ 



Eq. 
| [55ll 
| [55ll 
| [56ll 
1(53 
| [58ll 
| [59ll 

lED 

l(62)l 
| [63ll 
l(64)l 
| [65ll 
| [60ll 





+ 


+ 


+ 


+ 


+ 


+ 


+ 


+ 


lED 


+ 


+ 


+ 


+ 


+ 


+ 


+ 


+ 


|[35ll 


+ 


+ 


+ 


+ 


+ 


+ 






|[52ll 


+ 


+ 


+ 


+ 


+ 








|[36ll 


+ 


+ 


+ 


+ 










|[28ll 


+ 


+ 














13011 


+ 


+ 














13911 


+ 


+ 


+ 












lioli 


+ 




+ 


+ 












720 


690 


524 


513 


434 


406 


378 


378 



w 

2l3 

222 

2-2 

1 

2^ 

1 

2-3 
2-10 

1 

23 

2-3 

ho 



975.8 
228.1 
998.3 
267.2 
3194.3 
1961.9 
7.5 
62.6 
196.1 
324.3 
85.4 
2-10 



1013.0 
255.1 
989.6 
283.1 
3255.2 
1987.2 
7.6 
58.0 
199.9 
324.5 



1718.5 
497.8 

1131.8 
522.7 

5067.6 

3398.4 
11.7 
70.0 
289.3 
332.2 

200.4 

2-10 



2206.1 
562.6 

1190.0 
552.3 

5978.6 

3957.8 
22.6 

477.6 
331.6 

284.2 

2-10 



3307.7 5006.1 5940.6 

1022.6 1459.8 1662.8 

1013.9 1107.5 1266.4 

675.7 771.5 1042.8 

9712.6 12905.1 13887.4 

6313.1 7774.1 9329.4 

31.5 56.9 63.6 

615.7 661.4 782.9 

305.9 262.3 255.6 



591.4 
902.5 
720.9 
3432.9 
1872.7 
47.2 



264 



185 



4498.5 8176.7 12026. ( 



3010.9 
1684.5 
1318.9 
5439.0 
2990.5 
86.7 



4251.5 
1811.7 
1115.3 
8111.3 
4356.6 
119.7 



669.1 1315.5 1258.3 
278.5 263.5 232.2 



"The integral terms of the Eqs. given are included in the model if marked by the sign, otherwise set to zero. 
''The number of independent parameters optimized for each model. 

'^The error terms of the Eqs. given with '^the assigned weights and "^the ho values of Eq. I|60|l are listed (in au) for each model 
if included in the optimization, otherwise assigned zero weight and marked by "— ". 
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JCP Manuscript All.06.0280 

Title: A new parametrizable model of moleculeir electronic structure 
Author: Dimitri N. Laikov 

Discussion of Reviewers' Comments and Changes made to the Mciniscript 

Many thanks to the Reviewers for the careful reading of the Manuscript and the valuable 
comments and suggestions. I have an impression that my work has excited a lot of interest 
- the Reviewers seem to be eager to see more details on the performance of the new model, 
so they both ask for more tests of the model's accuracy to be done and reported right now. 
I strongly believe that such extended tests would make up a very good report publishable 
separately, not to say that I have limited time left would I have wished to do it now, 
moreover, I would be delighted to see such testing done by a third party. I find a good 
way to resolve this problem - the Supplementary Information will now include the binary 
executable file (single-threaded only) of my computer code for the most common machine 
architecture (Linux/x86_64) that can be used by anyone in the world to run molecular 
calculations with the new model - geometry optimizations of stable and transition states, 
harmonic vibrational analyses, intrinsic reaction coordinates - sample input files are also 
included and an input description is added to the README file. On the other hand, the 
Reviewers found no flaw in the most fundamental part of the work - the mathematics of 
the new model set forth in the 73 Equations of the Manuscript. The now highly cited 1963 
paper of Pople, Santry, and Segal (Ref. 35) comes immediately to my mind - it included 
only the mathematical equations and no numerical tests at all. 1 do include the Calculations 
section to show that the new model is indeed parametrizable and that the new terms added 
to the energy expression of the model lead to great improvements in the computed molecular 
properties. 

I have revised the Manuscript to improve its quality. The Reviewers' comments follow 
below (in typewriter font) with my answers and an account of the changes made. 

Reviewer #1 Evaluations: 

RECOMMENDATION: Publish in JCP with mandatory revision (minor) 
See attached file. 
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[the text extracted from file "l_reviewer_attachment_l_1311800956.pdf" follows] 
Referee report on All. 06. 0280 

Title: A new parametrizable model of molecular electronic structure 
Author: Dimitri Laikov 

In this paper, a heavily parametrized semi-empirical model 
is presented with significantly better accuracy than classic semi- 
empirical approaches. 

I find this topic should be interesting to the JCP readers and when 

properly presented would deserve a publication in JCP. 

However, the paper should be significantly improved before acceptance. 

1. My major issue is the quality of the Section III. The section lacks 
depth. This section is most important for users of the model, yet 
a lot of details are not explained. 

As I noted before, the Calculations (Section III) are of less importance than the Theory 
(Sections II) in this work, but I have made some changes and additions to address this issue 
(see also below). 

The only major data of the paper in Table III are poorly presented. 

It took me some time to understand that these are weighted errors 

multiplied by "w" factor but only after digging into the details 

of the Supplementary Info knowing some of the results I guessed 

the energy units are hartree. The table should be described in depth, 

the units explained and all the symbols either explained in the caption 

or related to the relevant section in the text. 

Footnotes are now added to Table III to explain the data, the weighted errors £e, ^f, ^d, 
Sg, Sh, Sd, Sa, Sq, Su, Sq in Table III were meant to be those defined in the text, but to 
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make it clearer one more column is now added to the table with equation numbers for each 
of these terms, the use of atomic units is stated. 

2. Nevertheless the raw errors for the training set are still not very- 
useful for 

a general reader. The data should be related to some other methods 

and the simplest way to achieve it is to utilize some precomputed 
databases. I recommend the GMTKN30 database [J. Chem. Theory Comput., 7, 
291-309 (2011)] 

with the data available on this webpage: 

\protect\vrule widthOpt\protect\href {http : //toe . uni-muenster . de/GMTKN/GMTKN30/GMTKN30mai 
The database is comprehensive, well organized and the 
geometry information is machine readable, however, 
other databases can also be used. The complete 

database would be the most useful but a relevant subset covering 

a wide range of chemical problems should also be sufficient. Mean errors, 

mean absolute errors and maximal errors would show bias, average accuracy 

and 

worst case accuracy of the method and could be compared to some DFT 
or wavef unction-based methods. Also, it would show how well the method 
is transferable to systems not used in the training set. The errors 
in the reaction energies from the Supporting Information are very good 
indeed, 

usually below 1 kcal/mol. Of course, some deterioration of the accuracy 
is 

inevitable, partly due to rather limited accuracy of the training set, 
however it would show how useful is the method at this stage. 

Thanks to the Reviewer for drawing our attention to the GMTKN30 database built up 
by the outstanding researchers from the Organic Chemistry Institute of the Westfalische 
Wilhelms-Universitat Miinster, Germany. I have visualized all 1218 molecular geometries 
from the GMTKN30 database - ihre Bilder habe ich vor mein geistiges Auge gestellt - and 
my trained eye tells me that there is likely less diversity in that set than in my own training 
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set of 5581 molecules! Yes, there are some bigger molecules among those 1218, but they are 
built up from some rather common fragments. As I noted before, the Calculations (Section 
III) of my work were intended mainly to show the parametrizability of the new model and 
the more extended tests should be done and reported separately, preferably by a third party. 
The first-generation parametrization of the model reported in this work is based on the 
reference data from the CCSD method with an entry- level basis set, so I would expect the 
atomization energies (and even the bond-breaking reaction energies) to agree rather poorly 
with any complete-basis-set extrapolated CCSD(T) or experimental thermochemical data. 
Other molecular properties - first of all the geometries, but also the concerted reaction 
energies, proton affinities, and hopefully even the harmonic vibrational frequencies may be 
already quite meaningful. 

I decided to do much more than just the benchmarking of the model on some well- 
established molecular database - I provide a binary executable code of my program (in 
the Supplementary Information) that can be used by anyone in the world to do molecular 
calculations with the new model such as (but not limited to) the testing and benchmarking 
of the model's accuracy, and I am looking forward to get any useful feedback, either privately 
or in a published form. 

3. The README in the Supporting Information should also be extended. Some 
of the information in the files is self-explanatory 

(e.g. errors) but I find others (e.g. some info from the files in the mol 

directory) 

very cryptic. 

The README file is now greatly extended to give detailed information on the format of 
the data files. 

4. The flow of the description in Section II is chaotic. Some 
of the symbols are defined long after they are first used which 
greatly impairs the reading. For instance, after Eq. (3), 
there is a discussion of E_2 term before all the symbols in 

Eq. (3) are explained. Moreover, section IIC uses some 
symbols that are only defined in sections IID and HE. 
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I find no better way to write the equations of the Theory (Section II). May the Readers 
forgive me this imperfection, although I find it quite natural to explain some terms later, 
this seems to me to be the best compromise. 

5. How were the parameters optimized? With non-linear parameters, there 
are 

several choices of methods. How the optimization was guided not to get 
stuck in a local minimum? 

I have added one sentence about the optimization algorithm to Section III. Needless to 
say that for such a heavy optimization problem I have no means to make global searches, 
I can only hope to be lucky enough to find the global minimum by intuition and painful 
trial-and-error work, or at least to find a good local minimum! 

6. It would help the readers if the "lengthy but straightforward" 
expressions on pages 7 and 8 were put in the main text or the 
Supporting Information. 

I think that even an average undergraduate student in physical sciences should be able to 
derive these expressions - the derivatives of the analj^ical formulas with only additions, 
subtractions, multiplications, and divisions, however lengthy they may be. 

7. Is the computer code described on page 8 going to be distributed? 
If the author plans to distribute the code, the reference in the paper 
would help the readers finding the code. 

1 put a single-threaded binary executable code into the Supplementary Information, along 
with an input description and sample input and output files. Those who are interested in a 
deeper work with the code may try to contact me personally for a collaboration. 

Reviewer #2 Evaluations: 

RECOMMENDATION: Reconsider for JCP after mandatory revision (major) 
Reviewer #2 (Comments to the Author) : 
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The manuscripts reports an impressive amount of work that is aimed 

at the development of a new general-purpose parametrizable model 

for electronic structure calculations. Key ideas are the use 

of a minimal basis set (MBS) of orthogonalized atom-centered basis 

functions for the SCF treatment, the inclusion of polarization effects 

in the spirit of second-order perturbation theory, parametrized expressions 

for the one-electron and two-electron integrals (with the neglect 

of many small terms) , an elaborate parametrization procedure 

that covers many different properties and employs CCSD reference data, 

the extraction of an effective MBS Hamiltonian from CCSD calculations, 

and the inclusion of dispersion effects into the two-electron integrals 

via an MP2-related procedure. The parametrization is carried out 

for a diverse and comprehensive basis set of 5581 molecules, 

and overall error measures for the molecular properties are reported 

for the recommended full model and 9 simplified variants. 

The EPAPS material contains more than 879,000 lines of computer-generated 

data. 

In the Introduction, the author gives a nice general overview over the field. 
In spite of the need to be selective, he should cite and briefly discuss 
the work of K. F. Freed who has written more than 20 papers between 1972 
and 1996 on the derivation of semiempirical MBS methods (in particular 
of the MBS integrals) from rigorous ab initio theory, in a spirit that is 
somewhat similar to the present work; leading references could be 
Acc. Chem. Res. 1983, 16, 137 (an early review) and JCP 1996, 105, 1437 
(one of the latest papers) . 

Thanks to the Reviewer for making me aware of Prof. Karl F. Preed's work, now I have 
read his key publications and I have added 4 citations of his work (Ref. 31-34). I believe it 
is a good idea to cite that work, although those model Hamiltonians are of a different kind 
- to be used in a fully correlated theory (full CI or an approximation thereof). 

The statement on page 4 that the orthogonalization-corrected methods 
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from ref 38-39 "did not lead to much improvement" over previous 
semiempirical methods also needs to be corrected, for detailed evidence 
see the following publications: JPCA 2007, 111, 5751; JCTC 2010, 6, 1546; 
JCTC 2011, 7, online (dx.doi.org/10.1021/ct200434a). 

I have now corrected that statement to " lead only to a limited improvement" which I beheve 
should be more diplomatic. 

I have read the abstracts of the suggested publications: 

1. Looking at Self- Consistent- Charge Density Functional Tight Binding from a Semiem- 
pirical Perspective, Nikolaj Ottc, Mirjam Scholten, and Walter Thiel, J. Phys. Chem. 
A, 2007, 111 (26), pp 5751-5755. 

2. Benchmark of Electronically Excited States for Semiempirical Methods: MNDO, AMI, 
PM3, OMl, 0M2, 0M3, INDO/S, and IND0/S2, Mario R. Silva- Junior and Walter 
Thiel, J. Chem. Theory Comput., 2010, 6 (5), pp 1546-1564. 

3. Benchmarking Semiempirical Methods for Thermochemistry, Kinetics, and Noncova- 
lent Interactions: OMx Methods Are Almost As Accurate and Robust As DFT-GGA 
Methods for Organic Molecules, Martin Korth and Walter Thiel, J. Chem. Theory 
Comput., Article ASAP. 

I find Paper 1 to be of little relevance to my work as it deals, first of all, with a different type 
of models (SCC-DFTB) that I am not wiUing to criticize. Paper 2 deals with electronically- 
excited states and is thus also from a different field. Paper 3 may be indeed interesting, 
although it deals only with molecules built up from 4 elements H, C, N, and O, moreover, I 
have no full-text access to JCTC at my institution now, asking a friend in the US to send 
me a copy of that paper would have taken some time because he is in vacation. Thus I 
cannot cite any of the 3 Papers. 

The Theory section contains many interesting and novel ideas. 
The proposed formalism clearly goes beyond the currently available 
semiempirical schemes and accounts for a number of further interactions. 
Having said this, one should also point out that this formalism still 
employs many approximations and neglects many terms (which is mandatory 
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when aiming for a very fast electronic structure method). For example, 

the method still uses a minimal basis, polarization is treated 

at second order with some empirical damping, numerous approximations 

are adopted for the molecular integrals (see II. B; use of additive schemes 

and factorizations and setting small terms to zero and fitting 

the remaining terms, etc), there is a fairly extensive parametrization, 

and the reference ab initio methods are CCSD with a rather modest basis 

(in general) and MP2 (for modeling the leading dispersion terms) . 

I find many of the methodological ideas interesting, but from the description 

in the manuscript, it is very difficult for me to judge the soundness and 

quality of the approximations in the proposed scheme. Generally speeiking, 

the reader would like to see more detailed justifications on the choices 

and approximations made; however, I do not see how this can be done easily. 

One possibility might be to give some numerical examples and integral plots 

for prototypical interactions and small molecules, but I admit that this 

could easily grow into a separate paper. 

I have though a lot in the last years on the foundation and justification of parametrizable 
electronic structure models and made many numerical studies on the behavior of the under- 
lying molecular integrals that can be computed in some way from first principles. 1 decided 
not to go into any such details in this work as it would have easily overwhelmed everything 
else to a point that hardly anyone in the world would have ever been able to read the whole 
text! The Reviewer admits rightly that "this could easily grow into a separate paper". 

The Calculations section summarizes massive computational work 
on the parametrization of 15 elements: H, Li-F, and Na-Cl. 
The accuracy of the results will be limited by the use 
of CCSD/TZVP-type reference data (on top of the other approximations, 
see above) . The quality of the results is addressed only very summarily 
by giving overall errors for various properties (Table III) . This is done 
in a way that baffles me. Even if I assume that the data in Table III 
are given in atomic units (not specified) , I do not really understand 
what the numbers mean. I do not see answers to the following simple 
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questions: What is the typical error of the best (full) model 

for energies, geometries, dipole moments, polarizabilities, etc? 

Are the errors fairly uniform for different elements or not? 

How does the method perform for standard benchmark sets such as G2 and G3? 

I have improved Table III according to the comments of Reviewer #1, the use of atomic units 
is now noted. The main point of Table 111 is to show how the accuracy of the model changes 
as the new terms are added or removed. The Reader can see the typical errors for energies in 
the listings (file "energy.dat") found in the Supplemetary Information, the weighted errors 
for all molecular properties studied for each molecule are also listed (file "errors.dat"), these 
data can be easily imported by the interested Reader into a spreadsheet and analyzed in 
many ways! Because this first-generation parametrization of the new model is based on the 
reference data from CCSD calculation with an entry-level basis set, the atomization energies 
would agree quite poorely with the experimental data, but the more chemically-interesting 
reaction energies should already be meaningful. I find the reproduction of molecular geome- 
tries, especially for weakly-bound intermolecular complexes, to be a greater challenge than 
even the accurate total energies for typical molecules, so my parametrization functional of 
Section IIC is constructed to give these properties a greater boost, as discussed in the text. 

To let anyone see how the model works for any molecular system of interest, I have now 
included the binary executable code of my program into the Supplementary Information, as 
noted above in the answer to the comments of Reviewer #1. 

The EPAPS material is also essentially inaccessible to me. I appreciate 
that the almost 880,000 lines of computer-generated data might contain 
all the answers to my questions, but I am afraid that nobody other than 
the author will be willing to check and digest this material. 

I have now added a detailed description of the file formats so that these data can be easily 
analyzed - if not by hand then using some simple parser program that even an average 
undergraduate student in physical sciences should be able to write if needed. To visualize 
all 5581 molecular structure of the training set one can run, for example, this command 
(under UNIX bash shell in an X- Window terminal) in the data/ directory: 

for m in mol/*.in ; do bin/xm $m ; done 
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The binary of xm viewer is provided in the Supplementary Information. 

What to do with this paper? On the one hand, I see potentially interesting 
ideas, but on the other hand, my dominant feeling is that I cannot really 
judge the quality of the approximations made nor the quality of the results. 
In my opinion, it would be a disservice to the author to publish 
this potentially important manuscript as is. 

I find this statement too pessimistic. Ideally, to "really judge the quality of the approxima- 
tions made" and "the quality of the results", an independent reproduction of all the results 
of my work would be needed, but this cannot be done quickly (in a few months) even by 
the bravest worker in this field. I fear a much greater "disservice to the author" would have 
been if the Manuscript were rejected on the maximalist grounds. 

I suggest major revision. Apart from the explicit points mentioned above, 
the author should be required to improve the manuscript with regard 
to the following: 

(a) better justification of the approximations made; 

This point is already discussed (see above). 

(b) better documentation of the quality of the results, including 
a statistical evaluation at least with regard to the G2/G3 and S22 
benchmark sets, and preferably also with regcird to other established 

benchmarks ; 

My training set of 5581 molecules is so diverse that these benchmark sets would hardly 
show anything new. Most molecules of the G2 set are already found in my set and it is 
nearly so for many other benchmarks. The traditional semiempirical models have rather 
few parameters and used to be parametrized on a small training set and then an extensive 
testing was very much needed to see their reliability. The situation is very different in my 
case where I aim at the highest imaginable diversity already in the training set. 

(c) some brief information on practical matters, e.g., cpu times relative 
to standard semiempirical and DFT methods; 
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The CPU times would depend on the implementation, but this model should be only 
marginally slower (less than twice) than the traditional semiempirical models, and this can 
be seen simply from the equations of the model, where the bottleneck is nearly the same 
linear algebra with the matrices of the same minimal-basis dimension. 

(d) adding a detailed table of content and detailed explanation 

to the EPAPS material so that readers can appreciate the documented results. 

This is done as noted above. 

I hope to have made the important changes to improve the quality of the Manuscript. 
Many thanks to everyone for the attention. 
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